Method and Apparatus For Segmentation of Audio Interactions

ABSTRACT

A method and apparatus for segmenting an audio interaction, by locating anchor segment from each side of the interaction, iteratively classifying additional segments into one of the two sides, and scoring the resulting segmentation, If the score result is below a threshold, the process is repeated until the segmentation score is satisfactory or until a stopping criterion is met. The anchoring and the scoring steps comprise using additional data associated with the interaction, a speaker thereof, internal or external information related to the interaction or to a speaker thereof or the like.

BACKGROUND OF THE INVENTION

1. Field of the Invention

The present invention relates to audio analysis in general and to amethod and apparatus for segmenting an audio interaction, in particular.

2. Discussion of the Related Art

Audio analysis refers to the extraction of information and meaning fromaudio signals for purposes such as word statistics, trend analysis,quality assurance, and the like. Audio analysis could be performed inaudio interaction-extensive working environments, such as for examplecall centers, financial institutions, health organizations, publicsafety organizations or the like. Typically, audio analysis is used inorder to extract useful information associated with or embedded withincaptured or recorded audio signals carrying interactions. Audiointeractions contain valuable information that can provide enterpriseswith insights into their business, users, customers, activities and thelike. The extracted information can be used for issuing alerts,generating reports, sending feedback or otherwise using the extractedinformation. The information can be usefully manipulated and processed,such as being stored, retrieved, synthesized, combined with additionalsources of information, and the like. Extracted information can includefor example, continuous speech, spotted words, identified speaker,extracted emotional (positive or negative) segments within aninteraction, data related to the call flow such as number of bursts infrom each side, segments of mutual silence, or the like. The customerside of an interaction recorded in a commercial organization can be usedfor various purposes such as trend analysis, competitor analysis,emotion detection (finding emotional calls) to improve customersatisfaction level, and the like. The service provider side of suchinteractions can be used for purposes such as script adherence, emotiondetection (finding emotional calls) to track deficient agent behavior,and the like. The most common interaction recording format is summedaudio, which is the product of analog line recording, observation modeand legacy systems. A summed interaction may include, in addition to twoor more speakers that at times may talk simultaneously (co-speakers),also music, tones, background noises on either side of the interaction,or the like. The audio analysis performance, as measured in terms ofaccuracy, detection, real-time efficiency and resource efficiency,depends directly on the quality and integrity of the captured and/orrecorded signals carrying the audio interaction, on the availability andintegrity of additional meta-information, on the capabilities of thecomputer programs that constitute the audio analysis process and on theavailable computing resources. Many of the analysis tasks are highlysensitive to the audio quality of the processed interactions. Multiplespeakers, as well as music (which is often present on hold periods),tones, background noises such as street noise, ambient noise,convolutional noises such as channel type and handset type, keystrokesand the like, severely degrade the performance of these engines,sometimes to the degree of complete uselessness, for example in the caseof emotion detection where it is mandatory to analyze only one speaker'sspeech segments. Therefore it is crucial to identify only the speechsegments of an interaction wherein a single speaker is speaking. Thecustomary solution is to use unsupervised speaker segmentation module aspart of the audio analysis.

Traditionally, unsupervised speaker segmentation algorithms are based onbootstrap (bottom up) classification methods, starting with shortdiscriminative segments and extending such segments using additional,not necessarily adjacent segments. Initially, a homogenous speakersegment is located, and regarded as an anchor. The anchored segment isused for initially creating a model of the first speaker. In the nextphase a second homogenous speaker segment is located, in which thespeaker characteristics are most different from the first segment. Thesecond segment is used for creating a model of the second speaker. Bydeploying an iterative maximum-likelihood (ML) classifier, based on theanchored speaker models, all other utterance segments could be roughlyclassified. The conventional methods suffer from a few limitations: theperformance of the speaker segmentation algorithm is highly sensitive tothe initial phase, i.e., poor choice of the initial segment (anchoredsegment) can lead to unreliable segmentation results. Additionally, themethods do not provide a verification mechanism for assessing thesuccess of the segmentation, nor the convergence of the methods, inorder to eliminate poorly segmented interactions from being furtherprocessed by audio analysis tools and providing further inaccurateresults. Another drawback is that additional sources of information,such as computer-telephony-integration (CTI) data, screen events and thelike are not used. Yet another drawback is the inability of the methodto tell which collection of segments belongs to one speaking side, suchas the customer, and which belongs to the other speaking side, sincedifferent analyses are performed on both sides, to serve differentneeds.

It should be easily perceived by one with ordinary skills in the art,that there is an obvious need for an unsupervised segmentation methodand for an apparatus to segment an unconstrained interaction intosegments that should not be analyzed, such as music, tones, low qualitysegments or the like, and segments carrying speech of a single speaker,where segments of the same speaker should be grouped or markedaccordingly. Additionally, identifying the sides of the interaction isrequired. The segmentation tool has to be effective, i.e., extract aslong and as many as possible segments of the interaction in which asingle speaker is speaking, with as little as possible compromise on thereliability, i.e., the quality of the segments. Additionally, the toolshould be fast and efficient, so as not to introduce delays to furtherprocessing, or place additional burden on the computing resources of theorganization. It is also required that the tool will provide aperformance estimation which can be used in deciding whether the speechsegments are to be sent for analysis or not.

SUMMARY OF THE PRESENT INVENTION

It is an object of the present invention to provide a novel method forspeaker segmentation which overcomes the disadvantages of the prior art.In accordance with the present invention, there is thus provided aspeaker segmentation method for associating one or more segments foreach of two or more sides of one or more audio interactions, with one ofthe sides of the interaction using additional information, the methodcomprising: a segmentation step for associating the one or more segmentswith one side of the interaction, and a scoring step for assigning ascore to said segmentation. The additional information can be one ormore of the group consisting of: computer-telephony-integrationinformation related to the at least one interaction; spotted wordswithin the at least one interaction; data related to the at least oneinteraction; data related to a speaker thereof; external data related tothe at least one interaction; or data related to at least one otherinteraction performed by a speaker of the at least one interaction. Themethod can further comprise a model association step for scoring thesegments against one or more statistical models of one side, andobtaining a model association score. The scoring step can usediscriminative information for discriminating the two or more sides ofthe interaction. The scoring step can comprise a model association stepfor scoring the segments against a statistical model of one side, andobtaining a model association score. Within the method, the scoring stepcan further comprise a normalization step for normalizing the one ormore model scores. The scoring step can also comprise evaluating theassociation of the one or more segments with a side of the interaction,using additional information. The additional information can be one ormore of the group consisting of: computer-telephony-integrationinformation related to the at least one interaction; spotted wordswithin the at least one interaction; data related to the at least oneinteraction; data related to a speaker thereof; external data related tothe at least one interaction; or data related to at least one otherinteraction performed by a speaker of the at least one interaction. Thescoring step can comprise statistical scoring. The method can furthercomprise: a step of comparing the score to a threshold; and repeatingthe segmentation step and the scoring step if the score is below thethreshold. The threshold can be predetermined, or dynamic, or depend on:information associated with said at least one interaction, informationassociated with an at least one speaker thereof, or external informationassociated with the interaction. The segmentation step can comprise aparameterization step to transform the speech signal to a set of featurevectors in order to generate data more suitable for statisticalmodeling; an anchoring step for locating an anchor segment for each sideof the interaction; and a modeling and classification step forassociating at least one segment with one side of the interaction. Theanchoring step or the modeling and classification step can compriseusing additional data, wherein the additional data is one or more of thegroup consisting of: computer-telephony-integration information relatedto the at least one interaction; spotted words within the at least oneinteraction; data related to the at least one interaction; data relatedto a speaker thereof; external data related to the at least oneinteraction; or data related to at least one other interaction performedby a speaker of the at least one interaction. The method can comprise apreprocessing step for enhancing the quality of the interaction, or aspeech/non-speech segmentation step for eliminating non-speech segmentsfrom the interaction. The segmentation step can comprise scoring the oneor more segments with a voice model of a known speaker.

Another aspect of the disclosed invention relates to a speakersegmentation apparatus for associating one or more segments for each ofat two or more speakers participating in one or more audio interactions,with a side of the interaction, using additional information, theapparatus comprising: a segmentation component for associating one ormore segments within the interaction with one side of the interaction;and a scoring component for assigning a score to said segmentation.Within the apparatus the additional information can be of the groupconsisting of: computer-telephony-integration information related to theat least one interaction; spotted words within the at least oneinteraction; data related to the at least one interaction; data relatedto a speaker thereof; external data related to the interaction; or datarelated to one or more other interactions performed by a speaker of theinteraction.

Yet another aspect of the disclosed invention relates to a qualitymanagement apparatus for interaction-rich environments, the apparatuscomprising: a capturing or logging component for capturing or loggingone or more audio interactions; a segmentation component for segmentingthe interactions; and a playback component for playing one or more partsof the one or more audio interactions.

BRIEF DESCRIPTION OF THE DRAWINGS

The present invention will be understood and appreciated more fully fromthe following detailed description taken in conjunction with thedrawings in which:

FIG. 1 is a schematic block diagram of a typical environment in whichthe disclosed invention is used, in accordance with a preferredembodiment of the present invention;

FIG. 2 is a schematic flowchart of the disclosed segmentation method, inaccordance with a preferred embodiment of the present invention; and

FIG. 3 is a schematic flowchart of the scoring process, in accordancewith a preferred embodiment of the present invention.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT

The present invention overcomes the disadvantages of the prior art byproviding a novel method and a system for locating segments within anaudio interaction in which a single speaker is speaking, dividing thesegments into two or more groups, wherein the speaker in each segmentgroup is the same one, and discriminating in which group of segments acertain participant, or a certain type of participant, such as a servicerepresentative (agent) of an organization, is speaking, and in whichgroup another participant or participant type, such as a customer, isspeaking. The disclosed invention utilizes additional types of datacollected in interaction-intensive environments, such as call centers,financial institutions or the like, in addition to captured or recordedaudio interactions in order to enhance the segmentation and theassociation of a group of segments with a specific speaker or speakertype, such as an agent, a customer or the like. The discussion below isoriented more to applications involving commerce or service, but themethod is applicable to any required domain, including public safety,financial organizations such as trade floors, health organizations andothers.

The information includes raw information, such as meta data, as well asinformation extracted by processing the interactions. Raw informationincludes, for example Computer Telephony Integration (CTI) informationwhich includes hold periods, number called, number called, DNIS, VDN,ANI or the like, agent details, screen events related to the current orother interactions with the customer, information exchanged between theparties, and other relevant information that can be retrieved formexternal sources such as CRM data, billing information, workflowmanagement, mail messages and the like. The extracted information caninclude, for example certain words spotted within the interaction, suchas greetings, compliance phrases or the like, continuous speechrecognition, emotion detected within an interaction, and call flowinformation, such as bursts of one speaker when the other speaker istalking, mutual silence periods and others. Other data used, include forexample voice models of a single or multiple speakers.

The collected data is used in the process of segmenting the audiointeraction in a number of ways. First, the information can be used toobtain an accurate anchor point for the initial selection of a segmentof a single speaker. For example, a segment in which a compliance phrasewas spotted can be a good anchor point for one speaker, specifically theagent. A highly emotional segment can be used as an anchor for thecustomer side. Such information can be used during the classification ofsegments into speakers, and also for posteriori assessment of theperformance of the segmentation. Second, the absence or presence, andcertainty level of specific events within the segments of a certainspeaker can contribute to the discrimination of the agent side from thecustomer side, and also for assessing the performance of thesegmentation. For example, the presence of compliance sentences andtypical customer-side noises (such as a barking dog) in segments ofallegedly the same speaker, can suggest a deficient segmentation. Thediscrimination of the speakers can be enhanced by utilizingagent-customer-discriminating information, such as screen events,emotion levels, and voice models of a specific agent, a specificcustomer, a group of agents, a universal agent model or a universalcustomer model. If segments attributed to one side have a highprobability of complying with a specific agent's characteristics or witha universal agent model, relating the segments to the agent side willhave a higher score, and vice versa. Thus, the segmentation can beassessed, and according to the assessment result accepted, rejected, orrepeated.

Referring now to FIG. 1, which presents a block diagram of the maincomponents in a typical environment in which the disclosed invention isused. The environment, generally referenced as 10, is aninteraction-rich organization, typically a call center, a bank, atrading floor, another financial institute, a public safety contactcenter, or the like. Customers, users or other contacts are contactingthe center, thus generating input information of various types. Theinformation types include vocal interactions, non-vocal interactions andadditional data. The capturing of voice interactions can employ manyforms and technologies, including trunk side, extension side, summedaudio, separate audio, various encoding and decoding protocols such asG729, G726, G723.1, and the like. The vocal interactions usually includetelephone 12, which is currently the main channel for communicating withusers in many organizations. The voice typically passes through a PABX(not shown), which in addition to the voice of the two or more sidesparticipating in the interaction collects additional informationdiscussed below. A typical environment can further comprise voice overIP channels 16, which possibly pass through a voice over IP server (notshown). The interactions can further include face-to-face interactions,such as those recorded in a walk-in-center 20, and additional sources ofvocal data 24, such as microphone, intercom, the audio of videocapturing, vocal input by external systems or any other source. Inaddition, the environment comprises additional non-vocal data of varioustypes 28. For example, Computer Telephony Integration (CTI) used incapturing the telephone calls, can track and provide data such as numberand length of hold periods, transfer events, number called, numbercalled from, DNIS, VDN, ANI, or the like. Additional data can arrivefrom external sources such as billing, CRM, or screen events, includingtext entered by a call representative, documents and the like. The datacan include links to additional interactions in which one of thespeakers in the current interaction participated. Another type of dataincludes data extracted from vocal interactions, such as spotted words,emotion level, speech-to-text or the like. Data from all theabove-mentioned sources and others is captured and preferably logged bycapturing/logging unit 32. The captured data is stored in storage 34,comprising one or more magnetic tape, a magnetic disc, an optical disc,a laser disc, a mass-storage device, or the like. The storage can becommon or separate for different types of captured interactions anddifferent types of additional data. Alternatively, the storage can beremote from the site of capturing and can serve one or more sites of amulti-site organization such as a bank. Capturing/logging unit 32comprises a computing platform running one or more computer applicationsas is detailed below. From capturing/logging unit 32, the vocal data andpreferably the additional relevant data is transferred to segmentationcomponent 36 which executes the actual segmentation of the audiointeraction. Segmentation component 36 transfers the output segmentationto scoring component 38, which assigns a score to the segmentation. Ifthe score exceeds a certain threshold, the segmentation is accepted. Ifthe score is below the threshold, another activation of the segmentationis attempted. The scoring and segmentation sequence is repeated until anacceptable score is achieved, or a stopping criterion is met. Thethreshold can be predetermined, or it can be set dynamically, takinginto account the interaction type, one or more of the speakers if known,additional data such as Computer-Telephony-Integration (CTI) data, CRM,or billing data, data associated with any of the speakers, screen eventsor the like. For example, the system can assign a higher threshold to aninteraction of a VIP customer, than to an interaction of an ordinarycustomer, or higher threshold for interactions involving opening anaccount or the like. It is obvious that if the audio content ofinteractions, or some of the interactions, is recorded as summed, thenspeaker segmentation has to be performed. However, even when the audiointeractions are recorded separately for each side, as is usually thecase in trunk-side or digital extension recording, there still issegmentation work to be done. Separating speech from non-speech isrequired in order to obtain fluent speech segments, by excluding segmentof music, tones, significant background noise, low quality or the like.In addition, there might still be effects of echo, background speech onthe either side, the customer consulting a third person, or the like,which require the segmentation and association of single-speakersegments with one speaker. The segmented audio can assume the form ofseparate audio streams or files for each side, the form of the originalstream or file accompanied by indexing information denoting thebeginning and end of each segment in which a certain side of theinteraction is speaking, or any other form. The segmented audio ispreferably transferred to further engines 40, such as speech-to-extengine, emotion detection, speaker recognition, or other voiceprocessing engines. Alternatively, the segmentation information or thesegmented voice is transferred for storage purposes 44. In addition, theinformation can be transferred to any other purpose or component 48,such as, but not limited to a playback component for playing thecaptured or segmented audio interactions. All components of the system,including capturing/logging components 32 and segmentation component 36,preferably comprise one or more computing platforms, such as a personalcomputer, a mainframe computer, or any other type of computing platformthat is provisioned with a memory device (not shown), a CPU ormicroprocessor device, and several I/O ports (not shown). Alternatively,each component can be a DSP chip, an ASIC device storing the commandsand data necessary to execute the methods of the present invention, orthe like. Each component can further include a storage device (notshown), storing the relevant applications and data required forprocessing. Each application running on each computing platform, such asthe capturing applications or the segmentation application is a set oflogically inter-related computer programs or modules and associated datastructures that interact to perform one or more specific tasks. Allapplications can be co-located and run on the same one or more computingplatform, or on different platforms. In yet another alternative, theinformation sources and capturing platforms can be located on each siteof a multi-site organization, and one or more segmentation componentscan be remotely located, segment interactions captured at one or moresites and store the segmentation results in a local, central,distributed or any other storage.

Referring now to FIG. 2 showing a flowchart of the main steps in theproposed speaker segmentation method. Summed audio as well as additionaldata, such as CTI data, screen events, spotted words, data from externalsources such as CRM, billing, or the like are introduced at step 104 tothe system. The summed audio can use any format and any compressionmethod acceptable by the system, such as PCM, WAV, MP3, G729, G726,G723.1, or the like. The audio can be introduced in streams, files, orthe like. At step 108, preprocessing is performed on the audio, in orderto enhance the audio for further processing. The preprocessingpreferably includes decompression, according to the compression used inthe specific interaction. If the audio is from an external source, thepreprocessing can include compression and decompression with one of theprotocols used in the environment in order to adapt the audio to thecharacteristics common in the environment. The preprocessing can furtherinclude low-quality segments removal or other processing that willenhance the quality of the audio. Step 110 marks, removes or otherwiseeliminates non-speech segments from the audio. Such segments can includemusic, tones, DMFT, silence, segments with significant background noiseor other substantially non-speech segments. Preprocessing step 108 andspeech/non-speech segmentation step 110 are optional, and can bedispensed with. However, the performance in time, computing resourcesand the quality of the speaker segmentation will degrade if step 108 orstep 110 are omitted. The enhanced audio is then transferred tosegmentation step 112. Segmentation step 112 comprises aparameterization step 118, an anchoring step 120 and a modeling andclassification step 124. At step 118 the speech is being parameterizedby transforming the speech signal into a set of feature vectors. Thepurpose of this transformation is to obtain a new representation whichis more compact, less redundant and more suitable for statisticalmodeling. Most of the speaker segmentation systems depend on cepstralrepresentation of speech in addition to prosodic parameters such aspitch, pitch variance, energy level and the like. The parameterizationgenerates a sequence of feature vectors, wherein each vector relates toa certain time frame, preferably in the range of 10-30 ms, where thespeech could be regarded as stationary. In another alternative method,the parameterization step is performed earlier as part of preprocessingstep 108 or speech/non-speech segmentation step 110. At step 118 thespeech signal is being divided into non-overlaping segments, typicallybut not limited to having a period of 1-3 seconds. The speakersegmentation main process starts at step 120, during which, anchorsegments are located within the audio interaction. Preferably, themethod searches for two segments to be used as anchor segments and eachof the two segments should contain speech of a different speaker. Eachanchor segment will be used for initial voice modeling of the speaker itrepresents. The first anchor segment finding is preferably performed bya statistical modeling of every segment in the interaction and then bylocating the most homogenous segment in terms of statistical voicefeature distribution. Such segment is more likely to be a segment inwhich a single speaker is speaking rather than an area of transitionbetween two speakers. This segment will be used for first speakerinitial voice model building. Locating such first segment can alsoinvolve utilizing additional data, such as CTI events, for example thefirst speaker in a call center interaction is likely to be the agentaddressing the customer. Alternatively, spotting with high certaintystandard phrases which agents are instructed to use, such as “company Xgood morning, how can I help you”, can help identify an anchor segmentfor the agent side, and standard questions, such as “how much would itcost to”, can help in locating homogenous segments of a customer side.Once the first anchor segment is determined, the method constructs astatistical model of the voice features in that segment where thestatistical model represents the voice characteristics of the firstspeaker. Subsequently, the method searches for a second anchor segment,whose statistical model is as different as possible from the statisticalmodel of the first anchor, the distance is measured and quantified bysome statistical distance function, such as a likelihood ratio test. Theaim of the second anchor finding is to find an area in the interactionwhich is most likely produced by a different statistical source, i.e. adifferent speaker. Alternatively, if the agent (or the customer) isknown and a voice model of the agent has previously been built usingother voice samples of the speaker or can be otherwise obtained,locating the segments of the agent can be done by searching for allsegments which comply with the specific agent model, and continuing byassociating all the rest of the speech segment with the customer (oragent) side. Once the two anchor segments are determined, the systemgoes into the modeling and classification step 124. Step 124 comprisesan iterative process. On each iteration, a statistical model isconstructed from the aggregated segments identified so far as belongingto each speaker. Then the distance between each segment in theinteraction and the speakers voice models is measured and quantified.The distance can be produced by likelihood calculation or the like.Next, one or more segments which are most likely to come from the samestatistical distributions as the speakers statistical models, i.e.produced by the same speaker, are added to the similar speaker's pool ofsegments from the previous iteration. On the next iteration, thestatistical models are reconstructed, utilizing the newly added segmentsas well as the previous ones, and new segments to be added are searchedfor. The iterations proceed until one or more stopping criteria are met,such as the distance between the model and the most similar segmentexceeding a certain threshold, the length of the added segments beingbelow a certain threshold or the like. During modeling andclassification step 124, soft classification techniques can also beapplied in determining the similarity between a segment and astatistical model or when calculating if a stop criterion is met. Oncethe modeling and classification is done, scoring step 128 takes place.Scoring step 128 assigns a score to the segmentation result. If thescore is below a predetermined threshold, the performance isunsatisfactory and the process repeats, restarting from step 120,excluding the former first and second anchor segments or from step 118using different voice features. The threshold can be predetermined, orit can be set dynamically, taking into account the interaction type,other data related to the interaction, additional data such as CTI data,external data such as CRM or billing data, data associated with any ofthe speakers, screen events or the like. The stopping condition for thesegmentation can be defined in a predetermined manner, such as “try atmost X times, and if the segmentation does not succeed, skip theinteraction and segment another one”. Alternatively, the stoppingcriteria can be defined dynamically, for example, “continue thesegmentation as long as there are still segments that no segments X orless seconds apart from them, have been used as anchor segments”. If thesegmentation score exceeds the predetermined threshold, the results areoutput at step 144. The Scoring process is detailed in association withFIG. 3 below. The results output at step 144 can take any required form.One preferred form is a file or stream containing text, denoting thestart and end locations of each segment, for example in terms of timeunits from the beginning of the interaction, and the associated speaker.The output can also comprise start and end locations for segments of anunknown speaker, or for non-speech segments. Another preferred formcomprises two or more files wherein each file comprises the segments ofone speaker. The non-speech or unknown speaker segments can be ignoredor reside in a separate file for purposes such as playback.

Referring now to FIG. 3 showing the main steps in the scoring assessmentprocess referred to in step 140 of FIG. 2. The scoring step comprisestwo main parts, assessing a statistical score and an agent-customerdiscrimination score. The statistical score determined at step 204 isbased on determining the distance between the model generated from thesegments attributed to one side and the model generated from thesegments attributed to the other side. If the distance between themodels is above a predetermined threshold, then the segments attributedto one side are significantly different than the segments attributed tothe other side, and the classification is considered successful. If thedistance is below a predetermined threshold (not necessarily equal tothe predetermined threshold mentioned above), the segments attributed todifferent speakers are not distinctive enough, and the classification isassumed to be unsuccessful. However, the statistical score can beproblematic, since the model-distance determination is calculated usingthe same tools and principles used when assigning segments to a certainspeaker during the classification step. Therefore, the segmentation stepand the testing step use the same data and the same calculations, whichmakes the examination biased and less reliable. Discriminative scoringstep 208 uses discriminative information, such as discriminativecustomer-agent information in order to assess the success of the speakersegmentation process, and to determine or verify the association of eachsegment group with a specific speaker. Discriminative scoring step 208is divided into model association step 212 and additional informationscoring step 216. Model association step 212 uses previously built orotherwise acquired universal models of agents and of customers. Theuniversal agent model is built from speech segments in which multipleagents of the relevant environment are speaking, using the same types ofequipment used in the environment. The universal customer model is builtfrom multiple segments of customers using various types of equipment,including land lines, cellular lines, various handsets, various types oftypical customer background noise and the like. The model preferablyincorporates both male and female customers if customers of both gendersare likely to speak in real interactions, customers of relevant ages,accents and the like. If the speaker segmentation includes side(agent/customer) association, step 212 is used for verification of theassociation; otherwise step 212 is used for associating each segmentgroup with a specific side. In model association step 212, the speechsegments attributed to each side are preferably scored against theuniversal agent model in step 220, and against the universal customermodel in step 224, thus obtaining two model association scores. The twomodel association scores are normalized in normalization step 228. Ifone segment group was assigned, for example, to an agent, and indeed thenormalized score against the universal agent model yielded asignificantly higher score than the scoring against the universalcustomer model, the association of the segment group to the agent sideis reinforced. However, if the score of agent-assumed segment groupagainst a customer model is higher then the score against the generaagent model, this might indicate a problem either in the segmentation orin the side association. The scoring can be performed for the segmentsattributed to a certain side one or more at a time, or all of themtogether, using a combination of the feature vectors associated with thesegments. If the segment group is not assigned to a specific side, anormalized score to one side which exceeds a certain threshold can beused in determining the side as well as the quality of the segmentation.Model association step 212 can be performed solely in order to associatea segment group with a certain side, and not just to assess asegmentation quality, in which case it is not part of discriminativescore 208 but rather an independent step.

In step 232 the method further uses additional data evaluation, in orderto evaluate the contribution of each segment attributed to a certainspeaker. Additional data can include spotted words that are typical to acertain side, such as “how can I help you” on the agent side, and “howmuch would that cost” for a customer side, CTI events, screen events,external or internal information or the like. The presence, possiblyassociated with a certainty level, of such events on segments associatedwith a specific side are accumulated or otherwise combined into a singleadditional data score. The scores of statistical scoring 204, modelassociation 212 and additional data scoring 232 are combined at step236, and a general score is issued. If the score is below apredetermined threshold, as is evaluate at step 140 of FIG. 2, thesegmentation process restarts at step 120 excluding the former first andsecond anchor segments. Since none of scoring steps 204, 212, and 232 ismandatory, combining step 236 weights whatever scores that areavailable. Each subset of the scoring results of scoring steps 204, 212and 232 can be used to produce a general scoring result. Combining step236 can be further designed to weight additional scores, such as userinput or other scoring mechanisms currently known or that will becomeknown at a later time. Combining step 236 can use dynamic orpredetermined parameters and schemes to weight or otherwise combine theavailable scores.

As mentioned above in relation to the statistical model scoring, and isapplicable for all types of data, the same data item should not be usedin the scoring phase if it was already used during the segmentationphase. Using the same data item in the two phases will bias the resultsand give higher and unjustified score to certain segmentation. Forexample, if the phrase “Company X good morning” was spotted at a certainlocation, and the segment it appeared on was used as an anchor for theagent side, considering this word during additional data scoring stepwill raise the score in an artificial manner, since it is known that thesegment the phrase was said in is associated with the agent side.

It will be appreciated by people skilled in the art that some of thepresented methods and scorings can be partitioned in a different mannerover the described steps without significant change in the results. Itwill also be appreciated by people skilled in the art that additionalscoring methods can exist and be applied in addition, or instead of thepresented scoring. The scoring method can be applied to the results ofany segmentation method, and not necessarily the one presented above.Also, different variations can be applied to the segmentation and thescoring methods as described, without significant change to the proposedsolution. It will further be appreciated by people skilled in the artthat the disclosed invention can be extended to segmenting aninteraction between more than two speakers, without significant changesto the described method. The described rules and parameters, such as theacceptable score values, stopping criteria for the segmentation and thelike can be predetermined or set dynamically. For example, theparameters can tale into account the type or length of the interaction,the customer type as received from an external system or the like.

The disclosed invention provides a novel approach to segmenting an audiointeraction into segments, and associating each group of segments withone speaker. The disclosed invention provides a scoring and controlmechanism over the quality of the resulting segmentation. The system

It will be appreciated by persons skilled in the art that the presentinvention is not limited to what has been particularly shown anddescribed hereinabove. Rather the scope of the present invention isdefined only by the claims which follow.

1. A speaker segmentation method for associating an at least one segmentfor each of at least two sides of an at least one audio interaction,with one of the at least two sides of the interaction using additionalinformation, the method comprising: a segmentation step for associatingthe at least one segment with one side of the at least one interaction;and a scoring step for assigning a score to said segmentation.
 2. Themethod of claim 1 wherein the additional information is at least one ofthe group consisting of: computer-telephony-integration informationrelated to the at least one interaction; spotted words within the atleast one interaction; data related to the at least one interaction;data related to a speaker thereof; external data related to the at leastone interaction; or data related to at least one other interactionperformed by a speaker of the at least one interaction.
 3. The method ofclaim 1 further comprising a model association step for scoring the atleast one segment against an at least one statistical model of one side,and obtaining a model association score.
 4. The method of claim 1wherein the scoring step uses discriminative information fordiscriminating the at least two sides of the interaction.
 5. The methodof claim 4 wherein the scoring step comprises a model association stepfor scoring the at least one segment against an at least one statisticalmodel of one side, and obtaining a model association score.
 6. Themethod of claim 5 wherein the scoring step further comprises anormalization step for normalizing the at least one model score.
 7. Themethod of claim 4 wherein the scoring step comprises evaluating theassociation of the at least one segment with a side of the interactionusing additional information.
 8. The method of claim 7 wherein theadditional information is at least one of the group consisting of:computer-telephony-integration information related to the at least oneinteraction; spotted words within the at least one interaction; datarelated to the at least one interaction; data related to a speakerthereof; external data related to the at least one interaction; or datarelated to at least one other interaction performed by a speaker of theat least one interaction.
 9. The method of claim 1 wherein the scoringstep comprises statistical scoring.
 10. The method of claim 1 furthercomprising: a step of comparing said score to a threshold; and repeatingthe segmentation step and the scoring step if said score is below thethreshold.
 11. The method of claim 10 wherein the threshold ispredetermined, or dynamic, or depends on: information associated withsaid at least one interaction, information associated with an at leastone speaker thereof, or external information associated with theinteraction.
 12. The method of claim 1 wherein the segmentation stepcomprises: a parameterization step to transform the speech signal to aset of feature vectors in order to generate data more suitable forstatistical modeling; an anchoring step for locating an anchor segmentfor each side of the interaction; and a modeling and classification stepfor associating at least one segment with one side of the interaction13. The method of claim 12 wherein the anchoring step or the modelingand classification step comprise using additional data.
 14. The methodof claim 13 wherein the additional data is one or more of the groupconsisting of: computer-telephony-integration information related to theat least one interaction; spotted words within the at least oneinteraction; data related to the at least one interaction; data relatedto a speaker thereof; external data related to the at least oneinteraction; or data related to at least one other interaction performedby a speaker of the at least one interaction.
 15. The method of claim 1further comprising a preprocessing step for enhancing the quality of theinteraction.
 16. The method of claim 1 further comprising aspeech/non-speech segmentation step for eliminating non-speech segmentsfrom the interaction.
 17. The method of claim 1 wherein the segmentationstep comprises scoring the at least one segment with a voice model of aknown speaker.
 18. A speaker segmentation apparatus for associating anat least one segment for each of at least two speakers participating inan at least one audio interaction, with a side of the interaction, usingadditional information, the apparatus comprising: a segmentationcomponent for associating an at least one segment within the interactionwith one side of the at least one interaction; and a scoring componentfor assigning a score to said segmentation.
 19. The apparatus of claim18 wherein the additional information is at least one of the groupconsisting of: computer-telephony-integration information related to theat least one interaction; spotted words within the at least oneinteraction; data related to the at least one interaction; data relatedto a speaker thereof; external data related to the at least oneinteraction; or data related to at least one other interaction performedby a speaker of the at least one interaction.
 20. A quality managementapparatus for interaction-rich environments, the apparatus comprising: acapturing or logging component for capturing or logging an at least oneaudio interaction; a segmentation component for segmenting the at leastone audio interaction; and a playback component for playing an at leastone part of the at least one audio interaction.